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A popular procedure for testing a pattern recognition machine is to 
present the machine with a set of patterns taken from the real world. The 
proportion of these patterns which are misrecognized or rejected is taken as 
the estimate of the error probability or rejection probability for the machine. 
In Part I, this testing procedure is discussed for the cases of unknown and. 
known a priori probabilities of occurrence of the pattern classes. The differ- 
ences between the tests that should be made in the tioo cases are noted, and 
confidence intervals for the lest results are indicated. These concepts are 
applied to various published pattern recognition results by determining the 
appropriate confidence interval for each result. 

In Part II, the problem of the optimum partitioning of a sample of fixed 
size between the design and test phases of a pattern recognition machine is 
discussed. One important nonparametric result is that the proportion of the 
total sample used for testing the machine should never be less than that 
proportion used for designing the machine, and in some cases should be a 
good deal more. 

PART I — ON ANALYSIS 

INTRODUCTION 

There are two distinct and consecutive processes usually involved in 
the feasibility study of a pattern recognition method or machine. The 
first process is the actual design of the machine. This might be based 
upon a set of sample patterns which the experimenter has gathered, 
from which he estimates the parameters of the machine. Alternatively, 
the experimenter may base his design on some a priori knowledge con- 
cerning the pertinent characteristics of the pattern classes under study. 
The second process is then the testing of this machine either in its hard- 
ware form or by its simulation on a general purpose computer. A differ- 
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ent set of sample patterns from that used in the design is used in this 
stage. 

The popular procedure for interpreting the test results is to take the 
proportion of patterns in the test data which have been misrecognized 
or rejected by the machine as the estimates of the error probability and 
rejection probability, respectively, for the machine. There are several 
questions which might be raised concerning this testing procedure, such 
as: 

1. Are these estimates the best estimates? 

2. If so, how good are these estimates? 

3. How does the estimate improve as the sample size is increased? 
Questions such as these are discussed in Part I of this paper. Two 

cases are considered; one is the case in which the a ■priori probabilities 
of class occurrence are unknown, and the other case assumes full knowl- 
edge of the a priori probabilities. 

Case 1 . Unknown a priori Probabilities — Random Sampling 

Let the number of allowable pattern classes be c. It will be assumed 
that, for each allowable class i, there exists an a priori probability of 
occurrence co, , a probability of error e { , and a probability of rejection 
ri . (For the rest of this paper, the term "error" will refer to an unde- 
tected error; all detected errors will be assumed to be rejected.) These 
probabilities are unknown to the experimenter, who is interested in esti- 
mating the overall probability of error for the machine. 

e =I>ei, (1) 

and the over-all probability of rejection, 

r = £ w . (2) 

Let him perform the following experiment, which will be called random 
sampling. Consider the patterns to be randomly generated by a "pattern 
source" according to the a priori probabilities of occurrence. He takes a 
pattern from the source, identifies it, and then lets his pattern recogni- 
tion machine attempt identification. He notes which of the three possible 
outcomes occurs: correct recognition, misrecognition, or rejection. This 
experiment is repeated n times, resulting in m e samples which have been 
misrecognized and m r samples which have been rejected. 

Since these outcomes are mutually exclusive, and each experiment 
independent, then the resulting random variables, m e and m r , clearly 
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are distributed according to the multinomial probability distribution. 
That is, the joint probability distribution of m e and m r , P(m e ,m r ), is 
given by 



P(m c ,m r ) = lm e m r Je'" e r'" r (\ 



- e -r) . (3) 



The maximum-likelihood estimates for e and r, denoted by e and r, are 
(hen 1 



c = — , 
n 



n 



(4) 



which are the estimates in common use. Further, each of these estimates 
is proportional to a single random variable having a binomial distribu- 
tion; therefore, ne and nr are themselves binomially distributed. The 
mean value of each estimate is the parameter for which it is an estimate ; 
the variance of each is 1 
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Because it is known that ne and nr are binomially distributed, con- 
fidence intervals can be applied to these estimates.* These confidence 
intervals require rather involved computations, but fortunately have 
been plotted for several values of n by various people. ' In Fig. 1 is 
shown such a plot of intervals for a 95 per cent confidence level computed 
by C. S. Cloppcr and E. S. Pearson. The use of this graph is fairly simple. 
A vertical line extended upward from the observed value of the estimate 
given on the abscissa will intersect the pair of curves pertaining to the 
particular sample size used. Projecting these two intersections horizon- 
tally onto the ordinate axis gives an interval for the parameter being 
estimated. The probability is 0.95 that the interval drawn in this manner 
includes the parameter. For instance, if a sample size of n = 250 yielded 
50 errors, then the estimate of the probability of error is 0.20. Using 
Fig. 1 it can be stated that, with probability 0.95, the true probability 
of error is included in the interval from 0.15 to 0.27. 



* Mmi ison- has used a similar argumenl for determining convergence of an 
adaptive system. However, he used Tchebycheff's inequality to obtain confidence 
intervals which are necessarily larger than if he had used such intervals pertaining 
to the binomial distribution. 
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Fig. 1 — 95 per cent confidence intervals for a binomially distributed variable. 

Case 2. Known a -priori Probabilities — Selective Sampling 

It is now assumed that the a priori probability of occurrence for each 
class, tat , is known. To take advantage of this knowledge, the experi- 
menter takes rti samples from each class i such that 



>u 



= Wi, 



(8) 



where n is the total number of samples. This process will be referred to 
as selective sampling.* (It will be assumed that the w,- are such that (8) 
can be fulfilled with the desired sample size, n.) 

* This sort of sampling dichotomy has been previously noted by others. For 
instance, Bowley 5 and Neyman 6 have referred to these two methods as "unre- 
stricted" and "stratified" sampling. 
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The machine is again allowed to attempt recognition of these patterns, 
resulting in m ti samples from class i being misrecognized, and m Ti sam- 
ples from class i being rejected. 

For any class i, the joint probability distribution for m ei and m r; again 
is multinomial : 



P(m ei ,m ri ) = lm,. 'm^W 



(1 - ei - I-,)"*"""-"". (9) 



Since each of these distributions is independent of the others in this ex- 
periment, then the joint probability of the outcome for all c classes is 
the product of the individual probabilities (9) : 

P(w f , , • • • ,m l!c ,m ri , • • • ,m rc ) 

c / n \ _ (10) 

= II l«* 'i»rJ«"*«r" r «(l - ei - r,-)" 1 m 'i 

This is no longer a multinomial probability distribution. However, since 
the maximum-likelihood estimate of a sum of independent variables is 
the sum of the maximum-likelihood estimates, then these estimates for 
e and r are 

t .t m " (id 



„ S M " (12) 

r = -5-- 

which again agree with the popular practice of using the proportions as 
estimates. The random variables of which ne and nr are values are not 
now binomially distributed, since a sum of binomially distributed vari- 
ables is not itself a binomial distribution in general. 

The mean of each estimate is again the particular parameter being 
estimated. The variance of each of these estimates can be computed: 

«r/ 2 = \£ a me n = I t nMl - O = - t »A(1 -«i), (13) 

in which use of (8) is made, and the prime distinguishes this variance 
from that for random sampling. Similarly, 

a . n = l2>,r,(l - r,). (14) 

It is of interest to compare these variances for selective sampling 
with those obtained for the case of random sampling. Since the variance 
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for r has the same form as b in both eases, it is necessary to consider 
only one of them, say e. First note that a f 2 can be written, using (1) and 
(6), as 

■/ = ^(§ Wiei )( 1 -S WfcC *)- (15) 



(16) 



From (13), 

at - at = - S w* c » 2 ( £ «A ) • 

Noting that y,f»i an = 1, (16) can be written as 

rf - «■/■ - i i; « («, - 2 «*«*Y - - Z ««<* - c ) 2 - '." ^ o- d7) 

Hence, the variance in the case of random sampling is greater than 
the variance in the case of selective sampling, the difference being what 
might be interpreted as the variance of the class errors. That is, if e, is 
treated as a random variable with probability distribution &>,• , then 
ae is the variance of c, . (A similar derivation holds for the variance 
of the rejection probability estimates.) That the selective sampling 
variance should be smaller than the random sampling variance might 
be expected, since in selective sampling more information is used, namely 
the a priori probabilities. 

Although statements have been made concerning the mean and 
variance of the estimates in the selective sampling case, nothing has 
been said yet concerning confidence intervals. This is a much more 
complicated problem than that in the case of random sampling, since 
the estimates do not have a simple distribution function. In fact, the 
confidence intervals will in general depend on the particular set of 
e.'s (or r.'s) pertaining to the machine, and not simply on e (or r). 

However, for small probabilities, the binomial distribution is quite 
closely approximated by the Poisson distribution, the fit becoming 
perfect as the probability approaches zero. For any reasonable recog- 
nition machine, one would expect the probabilities of error and rejec- 
tion to be small; consequently, the marginal form of (9) for m„,. or m ri 
may be approximated by a Poisson distribution. The estimates given 
by (11) and (12) are now sums of random variables with Poisson 
distributions (approximately) which are then themselves Poisson 
distributed. If the over-all error is also small, as is usually the case, the 
binomial-Poisson approximation can now be used in reverse, and one 
may state that, for small error rates, the error and rejection estimates 
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(11) and (12) are approximately binomially distributed. Consequently, 
one can use Fig. 1 to obtain 95 per cent confidence intervals for the 
error and rejection probabilities. Further, from (17), we would expect 
this confidence interval to be on the safe side, that is, the actual 95 
per cent confidence interval should be slightly smaller than this. 

APPLICATION TO PUBLISHED RESULTS 

To illustrate the ease of determining these confidence intervals, some 
published results in pattern recognition are listed in Table 1 along with 
the 95 per cent confidence intervals as determined from Fig. 1. It should 
be emphasized that Table I is not meant to compare one method against 
another, since the methods obviously treat problems of various com- 
plexities. Rather, the table is meant to compare the accuracies of the 
various evaluating experiments. 

Three points of caution should be noted concerning the validity of the 
confidence intervals in this table. First, the author is not positive that 
the test data is different from the design data in every case. Second, to 
the best of the author's knowledge, in every case the number of samples 
taken from each allowable pattern class was predetermined. This is 
selective sampling; therefore, it is assumed that the proportion of samples 
taken from each class represents its a priori probability of occurrence. 
The third assumption is that the patterns used to test the machine arc 
a reasonable sampling from the real-life world of patterns, and are not 
biased toward either well-formed or poorly-formed (noisy) patterns. 

CONCLUSION 

Two important cases concerning the testing of pattern recognition 
methods or machines have been considered: Random sampling for the 
case of unknown a priori probabilities of class occurrence, and selective 
sampling for the case of known a priori probabilities. The most pre- 
dominant form of testing in the present day art is to assume that the 
pattern classes have equal a priori probabilities of occurrence, and conse- 
quently to use equal sample sizes for each class; this is a special case of 
selective sampling. 

It has been shown that, for both cases, the maximum-likelihood esti- 
mate for the error probability or rejection probability is simply the 
proportion of samples misrecognized or rejected. In the case of random 
sampling, the estimates are binomially distributed, and accurate confi- 
dence intervals can be obtained. In the case of selective sampling, tighter 
estimates are obtained which are approximately binomially distributed 
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for small error rates. Conservative confidence limits may then be ob- 
tained for these estimates. 

Using these notions, the experimenter can now determine the sample 
size required to obtain results which he deems significant. Alternatively, 
if he has a limited sample size, he can determine the significance of his 
results. Note that in both cases considered, the variance is inversely 
proportional to the sample size. This does not mean that the confidence 
interval is inversely proportional to the square root of the sample size, 
however, since a binomial rather than a normal distribution pertains. 
However, perusal of Fig. 1 seems to indicate that this is a good rule of 
thumb. Note also that the total number of samples required to obtain a 
certain confidence in the results seems to be independent of the number 
of allowable pattern classes. This is an interesting philosophical point 
to ponder. 

PART II — ON DESIGN 

INTRODUCTION 

Part -I of this paper was concerned with the estimation of the per- 
formance of a given pattern recognition machine. There it was shown 
how confidence intervals could be found for these estimates. These 
results are nonparametric in that they hold for any categorization 
machine (or procedure) regardless of its structure. 

We now consider the following problem. An experimenter desires to 
solve a particular pattern recognition problem. He has at his disposal a 
set of different methods for solving this problem, but it is not clear to 
him which is the best to use. Consequently, he desires to estimate the 
performance of each method when applied to this problem, and choose 
the best. Let us assume that each method is characterized by certain 
key parameters which, when known, completely determine the recogni- 
tion machine. To evaluate any particular recognition method, the experi- 
menter plans to design the corresponding machine by estimating its 
parameters on the basis of one sampling from the real world of patterns, 
and then to test this machine based on another sampling (either by 
constructing the machine or by simulating it) . 

However, in many practical applications, the total sample size avail- 
able to the experimenter for design and test purposes is limited. For 
instance, he may be interested in building a machine to read hand- 
printed numbers, but he may not have an automatic scanner available 
to him. Since simulating a scanner by hand is very tedious, he may not 
be willing to scan more than a certain number of samples. 
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Or, he may be interested in distinguishing between radar returns 
caused by missiles and those caused by decoys. Since it is expensive to 
actually run the sort of experiment required to gather data for this 
problem, budget limitations will certainly place a limit on the number 
of available samples. 

Another example is in the field of automatic diagnosis of diseases. 
The experimenter may, for instance, be interested in building a machine 
which would determine the presence of cancer based on a list of symp- 
toms. However, records have been maintained for only a certain number 
of people who have contracted this disease, and the sample size is thus 
definitely limited. 

The following problem then arises. If the total sample size is fixed, 
what is the optimum partitioning of this sample between the design and 
test phases? This is a rather loose, but concise, statement of the problem. 
A more accurate one follows. 

Assume that the experimenter is concerned with the study of a par- 
ticular pattern recognition method as applied to some particular prob- 
lem. The optimum pattern recognition machine based upon this method 
would have an error probability e . The experimenter is interested in 
estimating e so that he can decide whether the particular "method 
under study is adequate for the solution of his problem, or alternately 
whether it is better than another method. To do this, he takes a sample 
of a certain size t from the real-life world of patterns. He desires to use 
part of this sample to design a machine according to the particular 
method under study. The machine which he thus designs will have an 
actual error probability e ^ e (both quantities are unknown to the 
experimenter). He then uses the remaining part of his original sample 
to test the machine (according to the procedures of Part I). He thus 
obtains an estimate of e, which will be denoted by L It will be shown 
that <? is a biased estimate of e , and that the bias can be computed. 
Consequently e can be adjusted so that it gives an unbiased estimate, 
e , of e„ . The optimum partitioning of the total sample will be defined as 
that partitioning which minimizes the variance of e . Thus, if the 
experimenter follows this procedure, he will obtain an unbiased minimum 
variance estimate of e„ , the optimum error probability. Of course, if he 
finally decides that a particular method is applicable, he can then re- 
design the corresponding machine with the entire sample size. 

OPTIMUM SAMPLE PARTITIONING 

We are interested, then, in minimizing the quantity 

c t ; = E\K - e ) 2 } = FAti] - e,;, (18) 
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where E[x] and o* denote the expected value and variance of x, re- 
spectively. 

Let us first digress and consider the biased estimate e. Since e is 
discrete (it is the proportion of test samples misrecognized), its expected 
value can be written 

E[e] = X>(g), 

where the summation is over all values of e, and p(x) denotes the proba- 
bility of x. But 

p(e) = J p(e\ e)p(e) de, 

where p(e | e) is the probability of e given e, and the integral is over all 
(continuous) values of e (by definition e ^ e ^ 1). Hence 

E[e] = XI e I p(e | e)p(e)de = J E &P($ I e )] p(e)de. 

Let us henceforth consider only the case of random sampling. Then e 
is proportional to a binomially distributed variable (ne) with parameter 
e. Therefore the term in brackets, which is the expected value of e 
given the parameter e, is just c. Then 

E[e] = f ep(e)de = E[e\. (19) 

E[e] is a function only of the parameters of the problem and the design 
sample size ; it is not a random variable. 

We next determine E[e], By going through a process analogous 
to the above, and by making use of (19), we obtain 

af = E[(e - E[e\f] = E[<?\ - {E[e]f = E[e{l ~ e)] , 

where n is the size of the test sample. Hence 

E \i\ = E[e(l - e)] + (E[e\)\ (20) 

n 

We now determine E[e], Let the optimum machine be described by 
c different parameters 5 01 , 1 ^ i ^ c. The design of the machine con- 
sists of estimating the parameters 5 0I - by making measurements on a 
set of sample patterns (the design sample). Let the estimates of these 
parameters be denoted 5, , 1 ^ i ^ c. Then the error probability e 
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of the resulting machine is a function of the estimates of the true param- 
eters: 

e = e(5] ,8 2 ,■ ■ -A). 

One can now expand e in a Taylor series expansion about its minimum 
point, e v . Since this is a minimum point, all the coefficients of the 
linear terms will be zero. If the error deviation (e — e ) is small, terms 
above the second order term may be neglected: 

s \8i — 8 i)\8j — 8 j). 



«-«. + ! EL '"' 



,-=1 y=i d8id8j 

The expected value of the error for the resulting machine is then 

d 2 e 



m = aiZE 



,=1 j=i dSidSj 



1. .E[(*< **)(«* -**)]■ 



If it is assumed that the estimates are unbiased, i.e., E(8i) = 8 c , then 
the above equation may be written, as 

E[e] = e + iEEc^i (21) 

where 

cfc 



an = an = 



3808 j 



a-jj is the covariance of the estimates for 8 i and 8„j , and <ru = <r, is the 
variance of the estimate for 8 > . (21) is valid for small values of the 
quantity (e — e ). 

It may be worth-while to digress here to a simple example which may 
help to clarify the definitions of the above terms. Zachary Oglethorpe 
is not only a crafty fisherman, but is also a good gadgeteer. He has 
decided to try to build equipment which will determine each day 
whether he should use a surface bait or a deep water bait in order to 
catch the maximum number of fish. He has means available to meas- 
ure the water temperature, the magnitude of surface ripple, and the 
atmospheric pressure, and therefore decides to use these as his measure- 
ments. He denotes values of these measurements by mi , m* , and mz 
respectively. 

Mr. Oglethorpe has been recording values of these measurements 
every day for the past six months, and has noted on each day whether 
he was more successful with surface or deep water bait. He thus has a 
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total sample size of roughly 180 samples, some from one pattern class 
(surface bait), and some from the other pattern class (deep water bait). 
Since each sample was taken without a priori knowledge of the class 
to which it belonged, then this constitutes random sampling; that is, 
the proportion of samples in each class is an estimate of the a priori 
probability of occurrence of that class. 

Our crafty fisherman decides to build a decision making, or pattern 
recognition, machine by building a correlator for each of the two possi- 
ble decisions (or pattern classes). That is, the machine will make the 
following two calculations: 

Surface bait = 5iWi + 8om 2 + 83WI3 , 

Deep water bait = Smi + hm* + hm 3 . 

The class achieving the highest value represents the desired decision. 
Let us assume that, according to some theory, the optimum values of 
the 5, are the means for each measurement within the appropriate pat- 
tern class, normalized so that the sum of the squares of the coefficients 
of each linear form is unity. That is, 5] is proportional to the mean 
water temperature when surface bait should be used, and so forth, 
and is normalized with 5 2 and 63 so that bi + 5 2 " + S3 = 1. 

Thus the parameters 5, completely characterize this pattern recogni- 
tion machine in that, given values for each 8, • , 1 g i ^ 6, the machine 
may be built. The optimum values for each 5, are the appropriate nor- 
malized means, which are the 5 , of the previous equations. Mr. Ogle- 
thorpe obtains estimates of these optimum parameters by taking 
normalized averages over a portion of the appropriate data. These 
estimates are the 5, of the previous equations, and are the actual num- 
bers on which he would base the construction of his machine. Note that, 
in this case, these estimates are unbiased and efficient, and may very 
well be independent of each other (e.g., the probability distribution of 
the water temperature when surface bait should be used may be inde- 
pendent of the values of surface ripple magnitude and atmospheric 
pressure). 

Having thus designed his fisherman's aid with a portion of his data, 
he now tests it with the remainder of the data to determine its accuracy. 
He does not want to use it if there is a good probability that it is less 
accurate than he has found his own intuition to be. This then leads 
us to the basic problem being studied: How should Zaehary Oglethorpe 
split his total sample between the design and the testing of his machine 
to obtain the best estimate of the accuracy of the machine? Again, if 
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the estimated accuracy of his machine were sufficient, he would then 
be wise to redesign it, basing the new design on the entire sample. 

We now return to the study of this sample partitioning. Let each 
parameter be estimated with m samples.* If each of these estimates is 
an efficient and unbiased estimate, and if the estimates are independent 
(either because the estimates are statistically independent, or because 
different samples are used to estimate each), then all <r,> = 0, i ^ j, 
and all <r, 2 will be proportional to 1/m. Hence one can rewrite (21) as 

E[e] = e + ~, (22) 

m 

where 6 is some constant calculated from (21). (Often, E[e] is in the 
form (22) even if the estimates are not independent.) 

Let / be the total sample size, and p be the number of sets of m sam- 
ples used to design the machine, p is chosen to be the smallest number 
which insures that E[e] is of the form (22). It is often simply the num- 
ber of allowable pattern classes, since, of course, parameters of different 
classes must be estimated with different samples. If n is the test sample 
size, then 

t = n + pm. (23) 

From (19) and (22), 

E[6\ = E[e] = e„ + - . (24) 

m 

Consequently, e is a biased estimate of e . The adjusted estimate & , 
given by 

e„=e--, (25) 

m 

is an unbiased estimate of e , with variance given by (18). This variance 
can now be rewritten using (25) : 



= E W] - e> = e[(s - A)] _ *" 

= E[e 2 ]-2-E[e} + (-\ -e \ 



* This is not always desirable, since some parameters may be easier to estimate 
than others, or there may be more data available for some parameters than 
others. However, this condition is assumed here for simplicity, as are the following 
assumptions of efficiency, unbiasedness, and independence. 
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From (20) and (24), 

2 E\e{\ - e)\ 



a e* = 



+ (Mr])- - 2-e - - - e a < . 



n m \m 



n \ m 

Thus, from (24), 

t _mi-rt m (26) 

n 

If b/m « 1 (which will certainly be true for any reasonable design), 
then 



*e„ ~ 



mi ~ e ' )] = (1 - e„) "' + ^ = (1 - e.) "" + T=-n , (27) 



where the relation (23) was used. 

We wish to choose n such that (27) is minimized. Differentiating 
(27) and equating to zero, one obtains 

2^-1 

Cot t 

pb = 7^P- (28) 

where n„ is that value of n satisfying (28) ; it is the optimum test sample 
size in the sense previously discussed. n u /t is of course the proportion of 
the total sample used for the test. One interesting rasult is immediately 
obvious: n /l must be greater than 0.5 for all cases. The equation (28) is 
plotted in Fig. 2, from which the following general statements can be 
made. 

1. The proportion of the total sample that should be used to test 
the machine should never be less than 50 per cent. 

2. If e (/pb < 0.1, then the proportion used for design should be 
about 50 per cent. 

3. The proportion of the total sample that should be used to test the 
machine becomes larger as: 

a. The total sample size increases, 

b. the error of the optimum machine increases, 

c. the effectiveness of the design increases (pb decreases). 

Here l/pb is taken as a measure of the effectiveness of the design, 
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Fig. 2 — Optimum sample partitioning. 

since pb is the product of the expected deviation from optimum, E[e — e ], 
and the design sample size, pm. 

These results indicate just how a sample should be split between the 
design and test stages of a feasibility study of a pattern recognition 
method. If the experimenter follows this procedure, he will obtain an 
estimate e„ of e u which is unbiased and has minimum variance. 

The value of this minimum variance can be expressed as 



<*t a ■ = 



e„(l - e ) 



1 + 



'-7 

2 — — 1 

I 



which was obtained by eliminating pb between (27) and (28). Note that 
this is the variance that would have been obtained if the optimum 
machine were tested with n samples, increased by a factor which accounts 
for the design error. 



AN EXAMPLE OF OPTIMUM SAMPLE PARTITIONING 

As an illustration of these ideas, consider the following example 
(perhaps the simplest of the re-dimensional problems). A pattern recog- 
nition machine is to be designed using the optimum decision function" ' 
which will distinguish between q classes. The occurrence of each class is 
equally probable a priori, and all costs of misrecognition are the same. 
The receptor makes a set of k measurements my , 1 ^ j ^ k, on each 
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input pattern. It is known that each measurement is normally dis- 
tributed with variance a, and that all measurements are independent. 
Further, it is known that the distances between the mean vectors in 
measurement space* are all equal. (Consequently, there can be no 
more than /«' + 1 pattern classes. The tips of the mean vectors arc the 
vertices of a regular polytope. ) 

It can then be shown that the optimum decision function partitions 
the measurement space into polytopes which are bounded by those 
hyperplanes which are the perpendicular bisectors of the line segments 
joining all pairs of means. The hyperplane separating two classes, say 
classes 1 and 2, is the set of all points (xi , • • •,.<>), represented by the 
vector X, which satisfy 

.r-(j5i — j5 2 ) = Kmi'Mi — M2-M2), (29) 

where /I, is the mean vector of class i. 13 

The design procedure consists of estimating each mean vector from 
a sampling; denote the estimated mean vector for class i by .c, . The 
distribution of the estimate of a mean vector from a normal distribu- 
tion with covariance matrix [T T ] is also normal with covariance matrix 
1/wi \V], where m is the sample size used in the estimate. 17 Since the 
measurements are independent in this case, then so will be the estimates 
of the means of the various measurements. Furthermore, each estimate; 
will have a variance of a"/m. Consequently, only one set of samples of 
size m from each pattern class is required to insure that the form (22) is 
valid, and p is hence equal to the number of allowable pattern classes, q. 

It is shown in the Appendix that b is given by 

where Am is t he distance between any pair of mean vectors, and N( An/2a) 
is the value of the standard normal density function for the variable 
A/j/2a-. The equation (28) then becomes 



2^ - 
■iej " t 



*t- »£»(£) "(i-*)'- 



n 

— i 

(30) 



* A geometric interpretation of categorization problems is often useful. By 
measurement space, we mean a A-dimensional space in which each coordinate 
represents one of the k receptor measurements. Thus any set of measurements 
which have been made on an input pattern may be represented as a point in 
measurement space. The decision function may be thought of as partitioning the 
measurement space into regions corresponding to the different allowable pattern 
classes and into rejection regions. 
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Fig. 3 — Optimum sample partitioning for symmetric Gaussian case. 

Some curves representing (30) are plotted in Fig. 3 in which the pro- 
portion of the total sample to be used in the test, n /t, is shown as a 
function of t, the total sample size, with the number of allowable pattern 
classes, q, as a parameter. e was held constant at 0.05 (which involves 
the choosing of the proper value of Am/2t for each q). 

From Fig. 3 it is seen that, for many cases, the sample should be split 
evenly between design and test, as one might intuitively suspect. How- 
ever, there are some drastic deviations from this. For instance, if the 
categorizer is to separate only two classes, and 1000 samples are avail- 
able, then only 50 of these should be used to design the machine, and 
950 should be used to test it. Consequently, it is seen that intuition may 
go wrong in some cases. 



CONCLUSION 



This paper has begun an analysis of some of the problems which arise 
in the design and analysis of pattern recognition experiments. In Part II, 
the problem of optimum sample partitioning between the design and 
test phases of a pattern recognition machine was investigated for the 
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case of a fixed total sample size and no overlap between the design and 
test samples. The general relation between the optimum partitioning 
and the total sample size, optimum error rate, and design efficiency was 
derived. From this, it was apparent that the test sample size should 
never be smaller than the design sample size. These results are non- 
parametric in the sense that they do not depend on the detailed structure 
of the recognition machine. It is only necessary that the deviation of the 
designed machine from the optimum machine be small, and that the 
design of the machine be done in such a way that (22) holds. 

However, the actual computation of the optimum sample partitioning 
does depend strongly on the detailed structure of the machine through 
the quantity b. Since this computation is quite difficult even in the 
simplest of cases, the interesting question arises as to the possibility of 
estimating b from the sample. Another interesting phase of this problem 
which has not been attacked here concerns the case when the design 
sample and test sample overlap — that is, some of the sample patterns 
from the design sample arc also used in the test sample. In the limit, 
this reduces to using the total sample for both design and test purposes. 
In this case, the results of the test are usually not very reliable. Conse- 
quently, there may be some sample partitioning with overlap which is 
better (in the sense discussed in this paper) than for either the case of 
no overlap or the case of total overlap. 
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APPENDIX 

We determine here the coefficient b in (22) for the example discussed 
in this paper. If the mean vectors arc more than about 3<r apart, then 
only a small error is made if the total error is approximated by adding 
the errors of each hyperplane taken alone. That is, the integrals on the 
wrong side of the hyperplane that are counted more than once will be 
quite small compared to the integrals counted only once. 

Due to the symmetry of the problem, the error associated with each 
hyperplane for the optimum decision function is identical, and the deriva- 
tives of (21) will also be identical for each hyperplane. Since there are 
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q (q _ i)/2 hyperplanes, b may be expressed (from (21) and (22)) as 
b _ q(g- 1) 1 yp 2 ei 2 f j% I V (31) 

where the hyperplane separating classes 1 and 2 is taken as typical, 
and the independence of the estimates is used. e n is the error associated 
with this hyperplane, m and m are the mean vectors of these classes, 
and Xi and x 2 are the estimates of the mean vectors. 

There is no loss in generality if m is taken as zero, and all the com- 
ponents of Ma (mis , • • • ,tn.->) are taken as zero except for n 12 . That is, 

Ml = (0,0, ■ • • ,0) 

M2 = ( M ,0, ••• ,0), 

where M12 is denoted m, m > 0. Consequently, the optimum boundary is 
given by 

Xy = M/2. 

A sampling of size m is taken from each class, and the mean vectors 
are estimated, giving 

Xi = (.fa ,X 2l , • • • ,Xkl) 

Xo = (.fn ,Xn , • • • ,%ki)- 

A boundary given by (29) is computed based on the above estimates, 
and this, together with the other estimated boundaries, determines 
the structure of the machine. 

The error e^ associated with this particular boundary for class 1 is 



• / / — exp -ol-J «wi, 

J ii(x2,...xik) y/2-na * \°7 

where fi(.r 2 , • • • ,#*) is the value of Xi on the boundary, and is given by 
(from (29)) 

., , xr *a - x i2 , ly (xa - x i2 ) 



^2 .I'll — X\1 ' 2 ,=1 X a — #12 

1 y 2(xn - a,- 2 )a; 1 - - (i 

2 ,=2 xu — x n 



x u + .^2 _ 1 y^ 2(fii - x i2 )xi — {xa — x,- 2 ) 
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Then 



de, 
dXi 



\_a- \xn - Xn/ \Xn — Xn/J 

,« = (vb ,,xp -K^)") 



35 



d Ci 



M<7 \Za/ 



2 £i £ n, 



where N(n/2a) is the value of the standard normal density function 
for the variate n/2a. In a like manner, 



da* 2 



-iif( -£)--±n(±), 

>i.mi MO" \ tfC/ Mo - VW 



2 g t'^ ■», 



where c 2 is the error associated with this boundary for class 2. Since the 
total error for this boundary is e l2 = Ci + Co , then 



dx,i 2 



ft!, ft, d.f',1 2 






= o. 



2 ^ i % n. 



A like result holds for d"Ci 2 , 2 ^ i ^ re. Going through this same 

2 

procedure for .fn , dxn 



95u 



i£dc— i^-t-i®'©} 



3"fi 
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Similarly, 



din* 
Hence 

d en 
dxn 2 

It would also he found that 



M1-M2 



8 <t« \2<r/ 



4 a 3 \2o 



-i5» 



(£) 



This analysis is perfectly general for arbitrary mean vectors, providing 
that n is merely interpreted as the distance between a pair of mean 
vectors (all such distances being assumed identical). This distance will 
henceforth be written Afx to indicate that it is a difference of means. 
Therefore, from (31), we find that 



q(q - 1) Am /Am\ 
4 2<r VW 
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